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Sentiment analysis on user reviews helps to keep track of user reactions towards products, and 
make advices to users about what to buy. State-of-the-art review-level sentiment classification 
techniques could give pretty good precisions of above 90%. However, current phrase-level sen¬ 
timent analysis approaches might only give sentiment polarity labelling precisions of around 
70% ~ 80%, which is far from satisfaction and restricts its application in many practical tasks. 

In this paper, we focus on the problem of phrase-level sentiment polarity labelling and 
attempt to bridge the gap between phrase-level and review-level sentiment analysis. We inves¬ 
tigate the inconsistency between the numerical star ratings and the sentiment orientation of 
textual user reviews. Although they have long been treated as identical, which serves as a basic 
assumption in previous work, we find that this assumption is not necessarily true. 

We further propose to leverage the results of review-level sentiment classification to boost 
the performance of phrase-level polarity labelling using a novel constrained convex optimization 
framework. Besides, the framework is capable of integrating various kinds of information sources 
and heuristics, while giving the global optimal solution due to its convexity. Experimental results 
on both English and Chinese reviews show that our framework achieves high labelling precisions 
of up to 89%, which is a significant improvement from current approaches. 

1. Introduction 

Sentiment analysis techniques could be classified into three levels according to the dif¬ 
ferent granularities on which the analysis is conducted, i.e., document-level, sentence- 
level and phrase-level ( Liu and Zhang 2012). When analyzing user reviews, document- 
level sentiment analysis is also referred to as review-level sentiment analysis. Figure [l] 
shows a user review on an Apple iPhone product, which is extracted from Amazon.com. 
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Technology Research and Development (863) Program (2011AA01A205) of China, and the third author is 
sponsored by the National Science Foundation (IIS-0713111). The opinions, findings, suggestions or 
conclusions expressed in this paper are the authors', and do not necessarily reflect those of the sponsors. 

** This paper is an extended version of the work Do users rate or review? boost phrase-level sentiment labeling 
with review-level sentiment classification in SIGIR'14. 
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Star Rating: ★ ★ ★ ★ ★ 

Review: I am very happy to have bought this phone from Amazon and the service rendered 
from the seller is excellent. Phone quality is perfect as new though I bought a used one. 

Care to their customers is something a key strategy the seller has followed. I would like to deal 
again with the same group in the near future and recommend to others highly. Thank you. 

Figure 1 

A sample user review towards an Apple iPhone 5 product from Amazon.com, which consists of 
a numerical star rating (5 stars here) and a piece of review text. The feature-opinion word pairs 
(service, excellent) and (phone quality, perfect) could be extracted from the review text to 
represent user attitudes (opinion words) towards specific product features (feature words). 


Review-level and sentence-level sentiment analysis attempt to determine the over¬ 
all sentiment orientation of a review or sentence. In phrase-level sentiment analysis, 
however, we are particularly interested in those phrases that describe some features 
or aspects of products, which, in Figure [T| for example, are the phrases service and phone 
quality. The user used excellent to modify the product feature service, and perfect for phone 
quality. Both pairs express user's positive sentiment on the corresponding feature. In this 
work, we use Feature word (F) to represent the words or phrases that describe specific 
product features, use Opinion word (O) for the words or phrases expressing users' 
sentiments towards feature words, and use Sentiment polarity (S) for the sentiment 
orientation of a Feature-Opinion (FO) pair. 

The construction of a sentiment lexicon is of key importance in phrase-level senti¬ 
ment analysis ( [Taboada et al. 20TT : |Liu, Hu, and Cheng 2005)|Ding, Liu, and Yu 2008 


Lu et al. 2011 ) . Each entry in the lexicon is an FO pair together with the corresponding 


sentiment polarity, represented by (F,0,S) ( [Popescu and Etzioni 2005) Lu et al. 2011) 
Taboada ~et al. 2011) . For example, the entries (service, excellent, positive) and (phone qual¬ 
ity, perfect, positive) could be extracted in the review of Figure [l The underlying reason 
for such approaches is the observation that the sentiment polarity of opinion words 
could be contextual (Wilson , Wi ebe, and Hoffmann 2005: Lu et al. 201 lj, which means 
that the same opinion word could lead to different sentiment orientations when used 
to modify different feature words. For example, the opinion word high has a positive 
sentiment when modifying the feature word quality, yet has a negative sentiment when 
accompanied by noise. 

However, current phrase-level sentiment lexicon construction approaches may only 
give sentiment polarity labelling (determining the S for an FO pair) precisions of around 
70% ~ 80% (|Liu and Zhang 2012: | Wilson, Wiebe, and Hoffmann 2005} Liu, Hu, and| 
Cheng 2005||Ding, Liu, and Yu 2008). Although the literature has shown that the overall 
sentiment classification precisions of a whole sentence or review could be reasonably 
high even the sentiment lexicon i s not that accurate ([Cui, Mittal, and Datar 2006}|Turney| 


2002J |Liu, Hu, and Cheng 2005| Ding, Liu, and Yu 2008| ), we argue that the problem 

of constructing an accurate sentiment lexicon itself is important, because the use of a 
sentiment lexicon might not be limited to aggregating the overall sentiment of sentences 
or reviews. In fact, it can be used in many promising tasks, such as word of mouth 
tracking of brands, tools for product design and optimization, and feature-level product 
search or recommendation. 

Previous work on phrase-level sentiment analysis and lexicon construction (Lu et 


al. 2011[ |Lu, Zhai, and Sundaresan 2009 : Dave, Lawrence, and Pennock 2003) assumes 
that the accompanied user rating indicates the overall sentiment orientation of a review 
text. However, we would like to point out according to our experiments on user rating 
analysis that, the star ratings might not be a kind of reliable signal in this task, and a 
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substantial amount of users tend to make similar or even the same ratings continuously, 
regardless of the review text that they comment on a specific product. 

In this paper, however, we propose to boost the process of phrase-level sentiment 
polarity labelling in a reverse way, which is to use review-level sentiment classification 
results as a heuristic for phrase-level polarity labelling. State-of-the-art review-level 
sentiment classification techniques, even the unsupervised approaches, can give pretty 
good precisions of above 90% (|Liu and Zhang 2012} |Lin, He, and Everson 2010} |Zag-| 
|ibalov and Carroll 2008a||Qiu et al. 2009}|Zagibalov and Carroll 2008b| , which could be 

reliable to help boost the performance of phrase-level sentiment polarity labelling. Here, 
we mainly focus our attention on unsupervised review-level sentiment classification 
techniques because of the fact that they need no manually annotated training data, 
which makes them domain-independent. Besides, they can achieve comparable or even 
better performance compared with supervised approaches, especially in some scenarios 
such as online product reviews ( |Liu and Zhang 2012) Lin, He, and Everson 2010) . 

We design a two-stage process for phrase-level polarity labelling. In the first stage, 
the overall sentiment orientations of the product reviews in the corpus are labeled using 
a review-level sentiment classifier. In the second stage, we extract feature-opinion pairs 
from the corpus, then use the overall sentiment orientations of reviews as constraints 
to learn the sentiment polarities of these pairs automatically using a novel optimization 
framework. Experimental results on both English and Chinese review datasets show 
that our framework improves the precision of phrase-level sentiment polarity labelling 
significantly, which means that it might be promising to leverage sentence- or review- 
level sentiment analysis techniques to boost the performance of phrase-level sentiment 
analysis tasks. The main contributions of this paper are as follows: 


• We investigate the phenomenon of inconsistency between the numerical 
star ratings and the sentiment polarities of textual user reviews though 
extensive experimental studies. 

• We propose to leverage review-level sentiment analysis techniques to 
boost the performance of phrase-level sentiment polarity labelling in 
sentiment lexicon construction tasks. 

• We formally define the problem of phrase-level sentiment polarity 
labelling as a constrained convex optimization problem and design 
iterative optimization algorithms for model learning, where the global 
optimal solution is guaranteed. 

• Through a comprehensive experimental study on both English and 
Chinese datasets, the effectiveness of the proposed framework is verified. 

The remainder of this paper will be structured as follows. Section [2] reviews some 
related work, and Section [3] formally defines the problem that we investigate. In Section 
[3J we propose our phrase-level sentiment polarity labelling framework, followed by the 
experimental results in Section [5j We conclude the work in Section [6] 

2. Related work 


With the rapid growth of e-commerce, social networks and online discussion forums, 
the web has been rich in user-generated free-text data, where users express various 
attitudes towards products or events, which have been attracting researchers into Sen¬ 
timent Analysis ( |Liu and Zhang 2012 : Pang and Lee 2008} . Sentiment analysis plays an 
important role in many applications, including opinion retrieval ( [Orimaye, Alhashmi, 
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level (Wiebe, Wilson, and Cardie 2005: Nakagawa, Inui, and Kurohashi 2010) and 
phrase-level ( Wilson, Wiebe, and Hoffmann 2005: Lu et al. 2011: Ding, Liu, and Yu 2008) 


Yue, and Cardie 2010} |Cui, Mittal, and Datar 2006: Bickerstaff e and Zukerman 2010 


Ma as et al. 201 1), unsupervised ([Turney 2002 Hu and Liu 2004: Lin, He, and Everson 


and Zhu 2006\ methods have been investigated 

Phrase-level sentiment analysis aims to analyze the sentiment expressed by users 
in a finer-grained granularity It considers the sentiment expressed on specific product 
features or aspects (Hu and Liu 2004). Perhaps one of the most important tasks in 
phrase-level sentiment analysis is the construction of Sentime nt Lexicon ([Taboada et 
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and Siew 2013| ), word-of- mouth tracking (jjansen et al. 2009 ), and o pinion oriented 
document summarization ( |Liu, Seneff, and Zue 2010 : Hu and Liu 20 04), etc. 

One of the core tasks in sentiment analysis is to determine the sentiment orien¬ 
tations that users express in reviews, sentences or on specific product features, corre¬ 
sponding to review(document)-level (Pang, Lee, and Vaithyanathan 2002), sentence- 


sentiment analysis. 

Review- and sentence-level sentiment analysis attempt to label a review or sentence 
as one of some predefined sentiment polarities, which are, typically, positive, negative 
and sometimes neutral (Liu and Zhang 2012). This task is referred to as Sentiment 
Classification, which has drawn much attention from the research community, and both 
supervised ([Pang, Lee, and Vaithyana than 2002 : Mul len and Collier 2004| Yessenalina, 


2010: Zagibalov and Carroll 2008a ]Qiu et al. 2d09{|Z agibal ov and Carroll 2008b|) or semi- 

supervised ([Dasgupta and Ng 2009[|Zhou, Chen, and Wang 2010||Li et al. 2011||Goldberg 


al. 2011} |Liu, Hu, and Cheng 2005 : Ding, Liu, and Yu 2008: Lu et al. 2011: Zhang 


et al. 2014b| ), which is to extract feature-opinion word pairs and their corresponding 


sentiment polarities from these opinion rich user-generated free-texts. The construction 
of a high-quality sentiment lexicon would benefit various tasks, for example, person¬ 
alized recommendation (Zhang et al. 2014a |2015| Zhang 2015) and automatic review 
summarization ( |Hu and Liu 2004}|Taboada et al. 20 Y\f . 

Although some opinion words like "good" or "bad" usually express consistent sen¬ 
timents in different cases, many others might have different sentiment polarities when 
accompanied with different feature words, which means that the sentiment lexicon is 
contextual ( [Wilson, Wiebe, and Hoffmann 2005) . 

Various information and heuristics could be used in the process of polarity labelling 
of the feature-opinion pairs. For example, it is often assumed that the overall sentiment 
orientation of a review is aggregated from all the feature-opinion pairs in it ( [Ding, Liu^ 


and Yu 2008} |Lu et al. 2011\ . Besides, some seed opinion words that express "fixed" 


sentiments are usually provided, which are used to propagate the sentiment polarities 


of the other words (Hu and Liu 2004 


Lu et al. 201 1). Som e work takes advantage of 


linguistic heuristics (Liu 2010 |Hu et al. 2013] ' Lu et al. 2011\ . For example, two feature- 


opinion pairs concatenated with the conjunctive "and" might have the same sentiment, 
while they might have opposite sentiments if connected by "but". The assumption of lin¬ 
guistic heuristic is further extended by sentential sentiment consistency in (Kanayama 
and Nasukawa 2006} . 


In this paper, we consider two main disadvantages of previous work. First, seldom 


Liu, and Yu 2008: Liu, Hu, and Cheng 2005 : Hu and Liu 2004 

Wilson, Wiebe, and Hoff- 

mann 2005). Second, they simply use 

? the numerical star rating as the overall sentiment 

polarity of the review text to supervise the process of phrase-level polarity labelling (Lu 

et al. 2011: Ding, Liu, and Yu 2008 : 

Lu, Zhai, and Sundaresan 2009 Dave, Lawrence, 

and Pennock 2003). In this work, we propose to boost phrase-level polaritv labelling 


4 











































































































Yongfeng Zhang et al. 


Boost Phrase-level with Review-level Sentiment Analysis 


with review-level sentiment classification, while incorporating many of the commonly 
used heuristics in a unified framework. 

3. Problem Formalization 


In this section, we formalize the problems to be investigated, as well as the notations to 
be used in this paper. 

Definition ( Sentiment Vector ) Suppose we are considering r sentiment polarity 
labels Si, S 2 • • • S r , for example, positive and negative when r = 2. A sentiment vector 
x = [xi,X 2 , ...,x r ) T (xi > 0) represents the sentiment orientation of a review, sentence 
or feature-opinion pair. The i-th element Xi in x indicates the extent of sentiment on 
the i-th polarity label Si . A function s : M r -A M is defined on sentiment vector x such 
that s(x) represents the overall sentiment of the review, sentence or feature-opinion pair. 


For example, if Si = positive and S2 = negative when r = 2, a sentiment vector 
x = [1,0] T of a review is used to indicate that the review has a sentiment orientation 
on positive and no sentiment orientation on negative. If the function s(x) = x\ — X2 is 
defined, then the overall sentiment orientation of the review is 1. Some previous work 
p i et al. 2011: Hu et al. 2013) enforces xi < 1 as an additional constraint, however, this 
constraint is not necessary in our framework. 


Definition ( Sentiment Matrix) For a set of m reviews, sentences or feature- 
opinion pairs ti,^2 • • 't m/ the m x r matrix X = [x x x 2 • • -x m ] T is used to represent 
the sentiment orientations of them, where Xi is the sentiment vector for ti, and 
s(X) = [s(xi)s(x 2 ) • • • s(x m )] T is their overall sentiment orientation. 


Definition ( Review-Level Sentiment Classification) Given a review corpus T of 
m user reviews ti, £2 ■ • ■ t m/ a review-level sentiment classification algorithm C 
gives C(T) = X mxr = [xix 2 • • ■\ m ] T = [C(ti)C(t 2 ) ■ ■■C{t m )] T , where x* = C(U) is the 
sentiment vector for the i-th user review t % given by C. 

Note that, in real applications, the review-level sentiment classification algorithm 
C could be both supervised or unsupervised. In this work, we consider unsupervised 
algorithms primarily to avoid the expensive manual labelling process and make our 
framework domain-independent. 

Definition ( Sentiment Lexicon) A sentiment lexicon constituting n FO pairs 
FO1FO2 • • • FO n is defined as an n x r sentiment matrix X = [xix 2 • • -x n ] T , where 
Xi is the sentiment vector for the pair FOi. 

As has stated before, the same opinion word may express different sentiment 
orientations when accompanied with different feature words, which makes a sentiment 
lexicon contextual. In this work, we use General Sentiment Lexicon and Contextual 
Sentiment Lexicon to indicate the two different kinds of sentiment lexicons. In the 
general sentiment lexicon, the sentiment vector of an FO pair is labelled according to 
its opinion word directly. For example, the opinion word excellent usually expresses a 
positive opinion regardless of the feature word, as a result, the sentiment vector for 
(service.excellent) will be labelled as [1,0 ] T directly. If the sentiment orientation of an 
opinion word is unknown, then the corresponding sentiment vector will be labelled as 
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[0,0] T . This lexicon is of high precision but low coverage. In the contextual sentiment 
lexicon, however, the sentiment vectors of FO pairs are labelled on considering of both 
the feature words and opinion words. 

Definition ( Phrase-Level Sentiment Polarity Labelling) In the process of contextual 
sentiment lexicon construction, once the feature-opinion pairs have been extracted 
from the review corpus, an important task is to determine the sentiment polarity of each 
FO pair, which is referred to as phrase-level sentiment polarity labelling. More formally, 
the task is to determine the sentiment matrix X nxr = [X1X2 • • -x n ] T , which is further 
transformed into the overall sentiment orientations s(X) = [s(xi)s(x 2 ) • ■ • s(x n )] T . 

In this work, the only input of our framework is a user review corpus T and a 
general sentiment lexicon X 0/ and the expected output is a contextual sentiment lexicon 
X, as well as their overall sentiment orientations s(X). The general sentiment lexicon Xo 
is achieved using some publicly available opinion word set^] These word sets contain 
simple and frequently used opinion words such as excellent, good , bad , etc., and they 
are commonly viewed as basic background knowledges in natural language processing 
tasks. 


4. The Framework 

In general, the framework is two-stage. In the first stage, we use an unsupervised 
review-level sentiment classification algorithm C to get the sentiment matrix X = C(T ) 
for the review corpus T; in the second stage, we extract feature-opinion pairs from 
corpus T and leverage the sentiment matrix X as well as the general sentiment lexicon 
X 0 in a unified optimization framework to obtain the contextual sentiment lexicon 
X. After that, a pre-defined function s(X) is used to determine the overall sentiment 
orientation of each feature-opinion pair. We introduce the details of the framework in 
the following part of the section. 

4.1 Review-Level Sentiment Classification 


The goal of this step is to present review-level overall sentiment polarities of user 
reviews by sentiment classification algorithms, with which to supervise the phrase-level 
sentiment polarity labelling process in the next stage. 

We choose to use unsupervised review-level sentiment classification algorithms 
mainly because of three reasons. The first is to avoid the manual labelling process of 
review-level sentiment polarities, which makes our framework domain adaptable. The 
second is to keep our framework as general as possible to ensure its independence 
from specific data requirements. For example, the optimization framework in (|Lu et| 
al. 2011) takes advantage of numerical ratings given by users in online shopping or 


review service websites, to supervise the process of phrase-level sentiment polarity 
labelling. However, such numerical ratings might not exist in specific environments, for 
example, in forum discussions, emails and newsgroups (Liu and Zhang 2012). Finally, 
the relationship between sentiment orientations of reviews and user ratings is still 
open (Moghaddam and Ester 2013: Liu and Zhang 2012). In fact, we will point out in 


1 We choose the commonly used MPQA sentiment corpus for English reviews and use HowNet for 
Chinese reviews. They will be formally introduced in the following part. 
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the experiments of user rating analysis that, the numerical ratings do not necessarily 
indicate the sentiment orientation of textual reviews. 

Classifying the sentiment orientations of user reviews into neutral or no opinion 
is usually ignored in most work, as has been pointed out in (Liu and Zhang 2012) 
and (Moghaddam and Ester 2013), for it is extremely hard and might bring about 
negative effects to positive and negative sentiment classification. As a result, we choose 
two-class classification frameworks for review-level sentiment classification, namely, a 
review is classified as either positive or negative. More formally, the dimensionality of 
a sentiment vector is r = 2, with the dimensionalities Si = positive and S 2 = negative, 
correspondingly. 

Two possible sentiment vector candidates are used in this stage. If a review is 
classified as positive by a sentiment classification algorithm C, then its sentient vector 
is assigned as x = [1,0] T . Otherwise, the corresponding sentiment vector is set to be 
x = [0,1] T . Based on the classification results, a sentiment matrix X is constructed: 


X = [xix 2 ■ • - x„ 


( 1 ) 


where x, is the sentiment vector of the i-th review in corpus T. X will be used as a 
constraint in the next stage. 

We use the sentence orientation prediction approach in (Hu and Liu 2004) for 
English reviews. In this approach, a small amount (around 30) of seed opinion words 
are manually selected to construct the positive word set and negative word set. For 
example, the words such as great, fantastic, nice and cool are in the positive word set, and 
words like bad and dull are in the negative word set. After that, each of the two word sets 
are expanded by adding the synonyms of their own words and antonyms of the words 
from the other set, where the synonyms and antonyms are defined in WordNet. At last, 
the positive and negative word sets are used to aggregate the overall orientations of 
reviews. 

We use the the automatic seed word selection scheme in ( [Zagibalov and Carroll 
2008a) for sentiment classification of Chinese reviews. It runs on Chinese characters di¬ 
rectly and does not require pre-segmentation. In this scheme, the positive and negative 
seed words are selected in an automatic framework by taking advantage of negation 
words, which are further used to aggregate the overall sentiment of a review. 

Both of the two methods achieve pretty high sentiment classification accuracies of 
around 90%, especially on some specific domains of product reviews. For example, the 
precision in digital camera and mobile phone reviews could be up to 92% in English 
corpora and 93% in Chinese corpora, which are the state-of-the-art performance in 
sentiment classification of English and Chinese texts. 


4.2 Sentiment Lexicon Construction 

In this stage, we construct the contextual sentiment lexicon. We generate the feature- 
opinion pairs first, and label their polarities in a unified optimization framework. 

4.2.1 Generation of Feature-Opinion Pairs. 

Each of the entries in the contextual sentiment lexicon is a feature-opinion pair. The 
feature word describes an aspect of a product, and the opinion word expresses the 
user's sentiment on the corresponding feature. In this stage, we first extract feature 
words from reviews, and then extract opinion words to pair with their corresponding 


7 




















Computational Linguistics 


Volume 1, Number 1 


feature words. 

Feature words extraction 

We extract feature word candidates first, and then filter out the wrong words using 
a PMI-based filtering approach. 

We use the Stanford Parser \M. Marneffe 2006) |Levy and Manning 2003) |Chang et| 
al. 2009) to conduct Part-of-Speech tagging, morphological analysis and grammatical 
analysis for both English and Chinese reviews. A sentence is converted into a 
dependency tree after parsing, which contains both the part of speech tagging results 
and grammatical relationships. The following example shows the dependency tree 
constructed for the review sentence "Phone quality is perfect and the service is excellent." 


(ROOT 

(S 

(S 

(NP (JJ Phone) (NN quality)) 
(VP (VBZ is) 

(ADJP (JJ perfect)))) 

(CC and) 

(S 

(NP (DT the) (NN service)) 
(VP (VBZ is) 

(ADJP (JJ excellent)))) 

(. •) ) ) 


We extract the Noun Phrases (NP) and retrain those whose frequency is greater than 
an experimentally set threshold. These phrases are treated as feature word candidates. 
E.g., the phases phone quality and the service could be extracted. 

The filtering process is then performed on the candidates in a similar way to that in 
(Popescu and Etzioni 2005). We first compute the Pointwise Mutual Information (PMI) 
of these feature word candidates with some predefined discriminator phrases, (e.g. in 
the domain of cellphone the discriminator phrases are "of phone", "phone has", "phone 
comes with", etc). The PMI of two phrases pi and p 2 is defined as: 


PMI(pi,p 2 ) 


Freq(pi,p 2 ) 
Freq(pi) • Freq(p 2 ) 


(2) 


where Freq(p) indicates the total term frequency of a phrase p in all the user reviews, 
and Freq(pi,p 2 ) is the frequency that p\ and p 2 co-occur in a subsentence. We use 
subsentences instead of sentences when computing PMI because it is often the case that 
a sentence covers different aspects in several subsentences, as stated in (Lu et al. 2011). 
A feature word candidate is retained if its average PMI across all discriminator phrases 
is greater than an experimentally set threshold. 

Opinion words extraction 

We attempt to extract opinion words from user reviews and assign them to appro¬ 
priate feature words to construct feature-opinion pairs, which serve as the entries in the 
sentiment lexicon. For English reviews, we extract the Adjective Phrases (ADJP) that 
co-occur with a feature word in a subsentence as an opinion word candidate, to pair 
with the corresponding feature word, which forms a feature-opinion pair candidate. 
For example, the adjective phrase perfect could be extracted as an opinion word, as it 
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co-occurs with the feature word phone quality in a subsentence. The Chinese reviews 
are processed in the same manner to extract opinion word candidates except that Verb 
Phrases (VP) are also taken into consideration besides adjective phrases. 

For each of the feature-opinion pair candidates, we compute their Co-Occure Ratio 
(COR) in terms of the corresponding feature word. The co-occur ratio of a feature word 
/ and an opinion word o is defined in the following way: 


COR = 


Freq(/, o) 
Freq(/) 


( 3 ) 


The notations are the same with those in equation Q. Intuitionally, a high COR 
score means that an opinion word candidate is frequently used to modify the corre¬ 
sponding feature word, which indicates that they might be more likely to form a feature- 
opinion pair. 

We also use an experimentally set threshold of COR to filter the feature-opinion 
pair candidates, and the retained pairs constitute the entries in the lexicon. It is possible 
to employ other techniques to construct the lexicon entries, but we choose a relatively 
simple approach so as to focus on the next step of sentiment polarity labelling. We adopt 
the unified framework based on Finite State Matching Machine describe in (Tan et al. 
2013) to locate the matched feature-opinion pairs in each review sentence. 


4.2.2 Constraints on Sentiment Polarity Labelling. 

In this step, we attempt to assign a unique sentiment polarity (positive or negative) 
to each of the feature-opinion pairs in the sentiment lexicon, using a unified convex 
optimization framework. The framework attempts to learn the optimal assignment of 
sentiment polarities by searching for the minimum solution to a loss function, where 
each term of the loss function is capable of representing the intuition of a specific 
evidence from a specific information source. Besides, the loss function is expected to 
be convex so that we can design a fast and simple optimization algorithm to find the 
unique global optimal solution to the problem. 

More formally, suppose the sentiment matrix of the n feature-opinion pairs 
extracted is represented by X e M nxr , where r is the number of sentiment polarity 
labels used, then we attempt to learn an optimal X and calculate the overall sentiment 
polarity of each pair by the function s(X). The objective function of the learning process 
consists of the following constrains based on different information sources. 

Constraint on Review-level Sentiment Orientation 

Although an online user review might be either positive or negative in terms of 
overall sentiment polarity, it does not necessarily mean that users only discuss about 
positive or negative features in a single piece of review. A positive opinionated review 
about a product does not mean that the user has positive opinions on all aspects of 
the product. Likewise, a negative opinionated review does not mean that the user 
dislikes everything. As a result, the overall sentiment orientation of a review is the 
comprehensive effect of all the feature-opinion pairs contained in the review text. 

Suppose we have m user reviews in the review corpus T, then the sentiment matrix 
X for the user reviews given by the review-level sentiment classification algorithm is an 
m x r matrix, where each row of the matrix is a sentiment vector for the corresponding 
review. 
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We construct a matrix A with the dimension mxnto indicate the frequency of 
each feature-opinion pair occurring in each review. Each row of the matrix represents a 
review, and each column represents a feature-opinion pair. The element for review i to 
pair j is defined as: 


T neg ?*eq(i,j) 

j ij 'EfeFreq (i,k) 


( 4 ) 


where Freq(i, j ) is the frequency of feature-opinion pair j in review i, and it would be 0 
if review i does not contain pair j. Therefore, J2 k F re q(L k) represents the total number 
of pairs that is contained in review i. The matrix I neg is an indication matrix that allows 
us to take the "negation rules" into consideration. I™ 9 = — 1 if the feature-opinion pair 
j is modified by a negation word in review i, e.g. "no", "not", "hardly", etc. Otherwise, 

j-neg _ 

ij 

According to our assumption that the overall sentiment of a text review is the 
comprehensive effect of all the feature opinion pairs it contains, we use a sentiment 
prediction function /(A, X) to estimate the sentiment orientations of the reviews, based 
on the review-pairs relationship matrix A and our contextual sentiment lexicon X. We 
expect to minimize the difference between our estimations and those given by the 
review-level sentiment classfication algorithm, which leads to the following objective 
function: 


= ||/(A,X) -X|||, (5.1) 

In this work, we choose a simple but frequently used sentiment prediction function, 
which is to predict the overall sentiment orientation of a review as the weighted average 
of all the feature-opinion pairs contained in it. This gives us the following objective 
function: 


Tli = || AX - X|||. (5.2) 

As the negation words have been represented by negative weights in A, multiplying 
A and X naturally incorporates the negation rule into consideration in ( |5.2| ). 

Constraint on General Sentiment Lexicon 

As stated in the definition of sentiment lexicon, some opinion words like excellent, 
good and bad have "fixed" polarities regardless of the feature word companioned. There¬ 
fore, we construct the general sentiment lexicon Xo by labelling the polarities of the 
feature-opinion pairs in X according to publicly available sentiment corpora directly. 

We first construct a positive opinion word set and a negative opinion word set for 
English and Chinese reviews, respectively. The word sets for English is constructed 
from the MPQA opinion corpu^j which contains 2718 positive words and 4902 neg¬ 
ative words, and the word sets for Chinese is constructed from HowNe^j with 3730 
positive words and 3116 negative words. For the i -th feature-opinion pair (/, o), the 


2 http://mpqa.cs.pitt.edu/corpora/ 

3 http://www.keenage.com/ 
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corresponding sentiment vector x* in Xq is: 


X; = 


[1, 0] T , if o is in positive word set 
< [0,1] T , if o is in negative word set 
[0,0] T , otherwise 


( 6 ) 


and finally, X 0 = [xix 2 • • • x n ] T serves as the general sentiment lexicon. 

We expect the sentiment polarities of the fixed opinion words learnt in the contex¬ 
tual sentiment lexicon X to be close to those in the general sentiment lexicon X 0 , which 
leads to the following objective function: 

n 2 = ||G(X-X 0 )|| 2 f (7) 


where G is a diagonal matrix that indicates which feature-opinion pairs in X are "fixed" 
by the general sentiment lexicon Xo. Namely, Gu = 1 if the z-th feature-opinion pair has 
fixed sentiment, and G a = 0 otherwise. 


Constraint on Linguistic Heuristics 

An important and frequently adopted type of linguistic heuristic is the conjunctives 
in user reviews (jLiu 2010: Hu et al. 2013). It is intuitional that feature-opinion pairs 
i and j that are frequently concatenated with "and" in the corpus might have similar 
sentiments, while those that are frequently connected by words like "but" tend to have 
opposite sentiments. For example, in the sentence "the phone quality is perfect and the 
sound effect is clear", if "perfect" is known to be positive, then it can be inferred that 
"clear" is also positive. 

To formalize the intuition, we define two n x n matrices W a and W 5 for the "and" 
and "but" linguistic heuristics, respectively, where W* 7 e [0,1] indicates our confidence 
that pair i and j have the same or opposite sentiments. A simple but frequently used 
choice is to set = WT = 1 if pair i and j are concatenated by "and" for a minimal 
number of times in all the subsentences in the corpus, otherwise, we set = W® = 0. 
Similarly, = W^- = 1 if pair i and pair j are linked by "but" for a minimal number 
of times in the corpus, and they are set to be 0 otherwise. 

To incorporate the "and" linguistic in our model, we propose to optimize the the 
following objective function: 

1 n n 

K = 2 E E ll X » - X i* II= 2 >(X t D“X) - Tr(X T W“X) ( 8 ) 

i=l j = 1 

where TV(-) is the trace of a matrix, and X$* represents the z-th row of X, which is also 
the sentiment vector for pair i. D a e M nxn is a diagonal matrix where D^- = J2j=i W. 
The underlying intuition in ([8} is that the sentiment vectors of pairs i and j should be 
similar to each other if they are frequently linked by "and", otherwise, a penalty would 
be introduced to the loss function. 

The formalization of "but" linguistic is similar, except that we expect the sentiment 
vector for pair i to be close to the "opposite" of the sentiment vector for pair j. More 
intuitionally, if X^* gains a high score in its first dimension, which implies that pair i 
tends to be positive, then Xj* should also gain a high score in its second dimension, 
which drives pair j to be negative, and vise verse. In order to model this intuition, we 
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introduce the following optimization term: 


^ u ii 

ii*-- X jt E|||w^- = Tr(X r D b X) - Tr{X T W b XE) (9) 


n n 


i =1 J = 1 


where E = [ J J ] is an anti-diagonal matrix that serves as a column permutation function 
to reverse the columns of X. Similarly, D 5 is a diagonal matrix where D- = Y^j=i 

Finally, the objective function regarding both "and" and "but" linguistic heuristic is: 

nz=n% + n\ = Tr(X T D a X) - Tr(X T W a X) + Tr(X T D 5 X) - Tr(X T W 6 XE) 

(10) 

= Tr(X T DX) - Tr(X T W a X) - Tr(X T W 6 XE) 
where D = D a + D 6 . 

Constraint on Sentential Sentiment Consistency 

The use of linguistic heuristic is extended in (Kanayama and Nasukawa 2006) 
by introducing sentential sentiment consistency (called coherency in (Kanayama and 
Nasukawa 2006| )). The fundamental assumption of sentential sentiment consistency is 
that the same opinion orientation (positive or negative) is usually expressed in a few 
consecutive sentences, which is reported to be helpful in improving the accuracy of 
contextual sentiment polarity labelling. 

To formalize the heuristic, a sentential similarity matrix W s e M nxn is introduced, 
which leverages the sentential distance between feature-opinion pairs in corpus to 
estimate their sentential similarities. For example, consider two pairs i and j, if they 
co-occur in the same piece of review in the corpus, then we calculate their sentential 
similarity in this review, and the final similarity between % and j is the average of all 
the intra-review similarities of their co-occurrences. More formally, suppose pair i and 
pair j co-occur in the same review for N tJ times, and the k -th co-occurrence happens in 
review ti k , then W- and WL are defined as: 


' 0, if Nij = 0 or W“. 7 ^ 0 or ± 0 



( 11 ) 


where the length of a review length(ri k ) is the number of words (punctuations ex¬ 
cluded) in the review, and the distance of pair i and j in the review dist(i,j ) is the 
number of words between the two feature words of the pair. Note that we do not 
consider two pairs if they have been constrained by "and" (W“- ^ 0) or "but" (W^ ^ 0) 
linguistic heuristic. Besides, a pair might co-occur for more than one times in the same 
review, and we consider all the pairwise combinations in such cases. 

Once the sentential similarity matrix is constructed, we incorporate sentential sen¬ 
timent similarity constraint by taking into account the following objective function: 



n n 


( 12 ) 


i=l j=1 
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Algorithm 1 Contextual Sentiment Polarity Labelling 


Require: A, X, X 0 , Ai, A 2 , A 3 , A 4 , N, S 

Ensure: X 

1: Construct matrix G in Equation (0 
2: Construct matrix D, W a and W 5 in Equation < p~Q| > 
3: Construct matrix D s and W s in Equation ( fl2] > 

4: Initialize X 4 — Xq, X 4 — Xq, n 4 — 0 
5: repeat 
6: n n + 1 

7: X' X 

8: for each element X 7J in X do 


9: 

10 : 

11 : 

12 : 


X, n 4 X ? ; ^ 


[AiA T X+A 2 GXo+A 3 W Q X+A 3 W b XE+A 4 W s % 

[AiA T AX+A 2 GX+A 3 DX+A 4 D s X] ij - 


end for 

until HX-X'lll < S or n > N 

return X 


where D s is also a diagonal matrix, and Df^ = Y^j=i 

The underlying intuition of this subjective function is that, a large penalty would be 
introduced if the difference of the sentiment vectors of two near pairs is significant. 

4.2.3 The Unified Model for Polarity Labelling. 

With the above constraints from different information and aspects, we have the follow¬ 
ing objective function for learning the contextual sentiment lexicon X: 

mm n = Ai||AX - X||f. + A 2 ||G(X - X 0 )||| 

+ A 3 (IY(X t DX) - Tr(X T W°X) - Tr(X T W b XE)) ( 13 ) 

+ A 4 (Tr(X T D s X) - Tr(X T W S X)) 

where Ai, A 2 , A 3 and A 4 are positive weighing parameters that control the contributions 
of each information source in the learning process. 

An important property of the objective function ( p~3] > is its convexity, which makes it 
possible to search for the global optimal solution X* to the contextual sentiment polarity 
labelling problem. We give the updating rule for learning X* directly here, as shown 
in and the proof of the updating rule as well as its convergence is given in the 
appendix. 


Xij 4 Xij \ 


1 [AiA t X + A 2 GX 0 + A 3 W a X + A 3 W 6 XE + A 4 W s X] i 
[AiA t AX + a 2 gx + A 3 DX + A 4 D s X]^ 


(14) 


The algorithm for learning the contextual sentiment lexicon is shown in Algorithm 
[l] In this algorithm, we first initialize the indication matrices, Laplacian matrices and 
sentiment matrices through line 1 to line 4. The predefined parameter N is the number 
of maximum iterations to conduct. The contextual sentiment lexicon X is updated 
repeatedly until convergence or reaching the number of maximum iterations, where 
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convergence means that the t^-norm of the difference matrix between two consecutive 
iterations is less than a predefined residual error S. 

4.2.4 Overall Sentiment Polarity of the Pairs. 

After the contextual sentiment lexicon X is constructed, we use the predefined func¬ 
tion s(X) = [<s(xi)<s(x 2 ) • • • <s(x n )]^ to determine the sentiment polarities of the feature- 
opinion pairs. In this work, we choose the function s(x^) = xn — X&, where xn and x^ 
are the values of positive and negative polarity labels in sentiment vector x t/ respec¬ 
tively. Pair i is labeled as positive if s(xi) > 0, and negative if s(xi) < 0. For simplicity, we 
leave out the polarity label of neutral like most of the existing work, as it is quite rare 
that xn = Xi 2 - 


5. Experiments 

In this section, we conduct extensive experiments to evaluate the proposed framework, 
and investigate the effect of different parameter settings in our framework. We will 
attempt to answer the following two research questions: 

1. Are the numerical star ratings always consistent with the overall sentiment 
orientations of textual user reviews? 

2. How effective is our proposed framework compared with other polarity 
labelling methods? 

We begin by introducing the experimental settings, and then investigate the rela¬ 
tionship between numerical ratings and text reviews. Finally, we evaluate the proposed 
framework, and make some comparisons with other techniques. 

5.1 Experimental Setup 

For the experimentation on English, we use the MP3 player reviews crawled from 
Amazon, which is publicly availably For the experiment on Chinese language, we 
use the restaurant reviews crawled from DianPing.coir^J which is a famous restaurant 
rating website in China, and we also made this dataset publicly availably Each of the 
reviews of the two datasets consists of a piece of review text and an overall numerical 
rating raging from 1 to 5 stars. We choose these two datasets from both English and 
Chinese as these two languages are of quite different types in terms of Linguistics. We 
want to examine whether our framework works in different language environments. 
Some statistical information about these two datasets is shown in Table [T] 


Table 1 

Some statistics of the two datasets. 



Language 

#Users 

#Items 

#Reviews 

MP3 Player 

English 

26,113 

796 

55,740 

Restaurant 

Chinese 

11,857 

89,462 

510,551 


4 http://sifaka.cs.uiuc.edu/~wang296/Data/ 

5 http://www.dianping.com/ 

6 http://tv.thuir.org/data/ 
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An important property of our restaurant review dataset is that, each review is 
accompanied with three sub-aspect ratings except for the overall rating. They are users' 
ratings made on th e flavour, environment and service of restaurants, respectively. A user 
is required to make ratings on all these three aspects as well as the overall experience 
when writing reviews on the website, which makes it possible for us to conduct much 
detailed user rating analysis on this dataset. The range of the sub-aspect ratings are also 
from 1 to 5. 


5.2 User Rating Analysis 


Previous work ( |Cui, Mittal, and Datar 2006 : |Lu et al. 201 1\ [Wang, Lu, and Zhai 2010\ 
labels a review text as positive if the corresponding overall rating is 4 or 5 stars, 
and negative if the overall rating is 1, 2 or 3 stars. However, the overall rating might 
not always be consistent with the sentient orientation of review texts. According to 
our observation, a substantial amount of users tend to make unaltered overall ratings 
although the sentiment orientation expressed in his or her review text might be quite 
different. Most interestingly, many users simply make 4 star ratings regardless of the 
review text he wrote. We analyze this phenomenon in the following part of this section. 


5.2.1 Overall and Sub-Aspect Ratings. 

We begin by analyzing the difference in the rating distributions of the restaurant dataset. 
The ratings on three sub-aspects allow us to investigate a user's "true" feelings on more 
specific aspects of the restaurant beyond the overall rating. For the overall rating and 
each of the sub-aspect ratings, we calculate the percentages that each of the 5 star ratings 
take in the total number of ratings, as shown in Figure [2] The x-axis represents 1 star 
through 5 stars, and the y-axis is the percentage of each kind of star rating. 



Figure 2 

The percentage of each of the five star ratings (1 star through 5 stars) against the total number of 
ratings, in terms of the overall rating, as well as the three kinds of sub-aspect ratings flavour, 
environment and service. 


We see that user ratings tend to center around 4 stars on overall rating, while they 
tend to center around 2^3 stars on the sub-aspect ratings. This implies that the overall 
rating might not serve as a real reflection of users' feelings, and users tend to "tell the 
truth" in much detailed sub-aspects. In order to examine the statistical significance, we 
calculate the average rating fi and coefficient of variation c v = a f fi for the overall rating 
and three kinds of sub-aspect ratings, where cr is the standard deviation. Table [2] shows 
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the results. We see that users tend to give higher scores on overall rating, and the scores 
on overall rating are more concentrated. 


Table 2 

The average ratings and the coefficient of variations of the overall rating and sub-aspect ratings. 


Overall 

Flavour 

Environment 

Service 

n 3.6432 

3.1547 

2.8934 

2.8510 

c„ 0.1977 

0.2522 

0.2697 

0.2816 


More intuitionally, we conduct per user analysis. For each user and each kind of 
rating (overall, flavour, environment and service), we calculate the percentage of 4+ 
stars (4 and 5 stars) that the user made. Then we sort these percentages of the users in 
descending order, which is shown in Figure [3] 



Sorted UserlDs 


Figure 3 

The percentage of 4+ stars made by each of the users. The points are sorted in descending order 
so as to identify the fractile quantiles more easily. 

It is clear that user rating behaviours on overall and sub-aspect ratings are different. 
More than a half of the users made 50% or more 4+ ratings in terms of overall rating, 
while less than 5% users did so on sub-aspect ratings. 

This analysis partly shows that it might not be appropriate to use overall ratings 
as groundtruth to label the sentiment orientations of review texts, as users tend to act 
differently when making overall ratings and expressing their true feelings on detailed 
product aspects or features. 

5.2.2 Labelling Accuracy using Different Methods. 

In fact, users might consider many other features except flavour, environment and 
service when giving the overall ratings, as a result, one may argue that the difference in 
distribution of ratings on different aspects is insufficient to imply that the overall rating 
is inappropriate in estimating the overall sentiment orientations of review texts. As a 
result, we evaluate the effect of sentiment orientation labelling of review texts using 
different kinds of numerical ratings. 

We randomly sampled 1000 reviews out of the 510,551 reviews from the restaurant 
review dataset to be labeled manually by 3 annotators. An annotator is asked to give a 
single label to a review text, which could be positive or negative. The final sentiment of a 
review is assigned by using the majority label (more than two) of the three annotators. 
The inter-annotator agreement is 79.76%, which is comparable to the reports of existing 
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work in sentiment analysis ( [Pang and Lee 2008 Lu et al. 2011 Finally, the annotated 
reviews consist of 673 (67.3%) positive reviews and 327 (32.7%) negative reviews. We 
use four methods to estimate the overall sentiment orientation of the review texts 
automatically: 


1. Overall Rating: A review text is labeled as positive if the overall star 
rating is > 4, and negative otherwise. The criterion of 4 stars is chosen to 
keep in accordance with previous work ( |Lu et al. 2011||Lu, Zhai, and 
jSundaresan 2009| for easier comparisons. 

2. Normalized Overall Rating: We use r' = r — Hi for labelling, where r is 
the overall rating, and ^ is the average rating of the corresponding user i. 

A review is labeled as positive if r' > 0 and negative if r' < 0. 

3. Sub-Aspect Rating: We use f = (77 + r e + r s )/3 for labelling, where 77 , r e 
and r s are the ratings on flavour, environment and service, respectively. 

The criterion of 4 stars is also used here. 

4. Sentiment Classification: We use the unsupervised review-level sentiment 
classification method described in section |44~| for orientation labelling. 

We use precision to evaluate the performance of each method, and the golden stan¬ 
dard is our human annotations. The results are shown in Table |3J where "Pos.Review" 
and "Neg.Review" represent the precisions of labelling positive and negative reviews, 
respectively, and "Overall" is the overall performance of review-level orientation la¬ 
belling. 


Table 3 

The precisions of review-level sentiment orientation labelling using different methods. 



1-Overall 

2-Normalize 

3-Subaspect 

4-Classify 

Pos.Reviews 

0.8321 

0.5438 

0.8009 

0.9064 

Neg.Reviews 

0.7248 

0.7859 

0.7951 

0.8563 

Overall 

0.7970 

0.6230 

0.7990 

0.8900 


We see that the orientation labelling performance by using sub-aspect ratings (3- 
Subaspect) overcomes the performance of using overall ratings directly (1-Overall). 
However, both of their performances are worse than that given by the sentiment classi¬ 
fication method (4-Classify). 

Intuitionally, we expect to get better performance by conducting user-based nor¬ 
malization, as different users might have different rating scales. However, experimental 
results show that user-based normalization (2-Normalize) gives the worst performance. 
This is also because of the fact that many users make relatively high overall ratings 
consistently, regardless of the ratings they made on the sub-aspects. Suppose that a user 
made two ratings, of which one is 4 stars and the other is 5 stars. The ratings would 
be normalized to be -0.5 and 0.5, which leads to a negative label and a positive label. 
However, the 4-star review text and 5-star review text might both be positive in the 
golden standard. 

In the following part of the experiments, we use the classification method primarily 
to label the review-level sentiment orientations, whose result is further used to su¬ 
pervise the process of contextual sentiment lexicon construction. Besides, we use the 
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results given by overall rating labelling and sub-aspect rating labelling for performance 
comparison. 

5.3 Feature-Opinion Pair Extraction 

There are three experimentally set parameters in the process of feature-opinion pair 
extraction, which are 1) the minimum term frequency (denoted by freq ) of Noun- 
Phrases to be selected as candidate feature words, 2) the minimum Pointwise Mutual 
Information (denoted by pmi) of a feature word candidate to be retained during the 
filtering process, and 3) the minimum Co-Occure Ratio (denoted by cor) of a pair to be 
selected as a final feature-opinion pair. 

Pool Lexicon: In principle, we used relatively strict parameters in order to get high 
quality feature-opinion pairs, which allows us to focus on the core task of phrase- 
level polarity labelling in this work. After careful parameter selection, we set freq = 
10,pmi = 0.005, cor = 0.05 on the mp3 player dataset, which leaves us with 1063 pairs, 
and freq = 20, pmi = 0.01, cor = 0.05 on the restaurant review dataset, which gives 
us 1329 pairs. These pairs are presented to the three annotators for polarity labelling 
(positive or negative), and the final polarity of a pair is assigned according to the 
majority of the labels. The average agreement among annotators in this task is 81.84%. 
The pool lexicon is used for evaluating the precision of polarity labelling. 

Golden Standard Lexicon: We then present the feature-opinion pair lists to human 
annotators to construct the golden standard lexicon. In this stage, each annotator is 
asked to select the feature-opinion pairs that describe an explicit aspect of an mp3 player 
or a restaurant. A pair is retained if it is selected by at least two annotators among the 
three, and the average agreement among annotators is 78.69%. The purpose of this stage 
is to further filter out the low quality pairs in the pool lexicon. For example, the pair 
(service is, good) could be discarded as the right feature word should be service, rather 
than service is, although this pair does express a positive sentiment. The final golden 
standard lexicons for mp3 player dataset and restaurant review dataset consist of 695 
and 857 feature-opinion pairs, respectively, and it is used for the evaluation of recall. 

We use different lexicons to evaluation precision and recall to avoid the problem of 
evaluation bias pointed out in (Lu et al. 2011). 

5.4 Phrase-Level Polarity Labelling 

In this section, we conduct automatic sentiment polarity labelling on the feature-opinion 
pairs in the pool lexicon, and report the evaluation results of our method and the 
methods for comparison. 


5.4.1 Evaluation Measures. 

We choose the frequently used measures precision, recall and F-measure to evaluate the 
performance of polarity labelling, which are defined as follows: 


• • Np aqree 

precision = ——, 

^lexicon 


recall = 


N, 


g_agree 


N, 


gold 


_ 2 x precision x recall 

F-measure =--— 

precision + recall 


where Ni exicon is the number of feature-opinion pairs in the automatically constructed 
sentiment lexicon, and N go id is the number of pairs in the golden standard lexicon 
(695 on the mp3 player dataset and 857 on the restaurant review dataset). N p _ agree and 
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N g _agree are the number of pairs consistently labeled with the pool lexicon and golden 
standard lexicon, respectively. 

5.4.2 Polarity Labelling Results. 

We adopted the following methods for phrase-level sentiment polarity labelling in the 
experiment: 


General: Make predictions by querying the polarity of the opinion word in 
general sentiment opinion word sets. Also, we use MPQA for English and 
HowNet for Chinese, as in section 4.2.2| A pair is discarded if the polarity 
of its opinion word could not be determined. 

Optimize: The optimization approach proposed in (Lu et al. 2011), which 
reduces the problem of sentiment polarity labelling to constrained linear 
programming. 

Overall: Use our framework except that the review-level sentiment 
orientation is determined using the corresponding overall rating. 
Subaspect: Use our framework except that sentiment orientations of 
reviews are determined by averaging the corresponding sub-aspect 
ratings. 

Boost: Use our complete framework, where the sentiment classification on 
review text is conducted to boost phrase-level sentiment polarity labelling. 


We set S = 0.01 and N = 100 in algorithm [l] to ensure convergence, and use Ai = 
A 2 = A 3 = A 4 = 1 in this set of experiment. Results on the two dataset using the above 
five methods are shown in Table [4j in which the bolded numbers are the best perfor¬ 
mance on the corresponding measure. We did not perform the "Subaspect" method on 
mp3 player reviews as the sub-aspect ratings are absent on this dataset. 


Table 4 

Performance of sentiment polarity labelling using different methods on the MP3 player dataset 
(English) and restaurant review dataset (Chinese). 



Precision 

Recall 

F-measure 

MP3 Player Data 

General 

0.9238 

0.4201 

0.5776 

Optimize 

0.8269 

0.7626 

0.7934 

Overall 

0.8288 

0.7525 

0.7888 

Boost 

*0.8504 

0.7683 

0.8073 


Restaurant Review 


General 

0.9017 

0.3571 

0.5115 

Optimize 

0.8405 

0.7760 

0.8069 

Overall 

0.8473 

0.7468 

0.7938 

Subaspect 

0.8675 

0.7561 

0.8079 

Boost 

*0.8879 

0.7818 

0.8315 


We see that labelling the polarities by querying the general opinion word sets gives 
the best precision on both of the two datasets. However, the recall of this method is 
rather low. This implies that there are many "context dependent" opinion words which 
are absent from these word sets. 
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The "Optimize" method in (Lu et al. 2011) and our "Overall" method are similar 
in that both of them leverage overall numerical ratings as the groundtruth of review- 
level sentiment orientations, and they make use of similar heuristics and constraints. 
Though the Optimize method achieves slightly better recall, their overall performance 
are comparable. Further more, by taking advantage of the sub-aspect ratings in the "Sub¬ 
aspect" method, both precision and recall are improved from "Optimize" and "Overall" 
methods, which implies that the detailed sub-aspect ratings could be more reliable than 
overall ratings. 

Finally, our "Boost" method achieves the best performance in terms of recall and F- 
measure, on both of the two datasets. Besides, it also achieves the best precision without 
regard to the "General" method. This further verifies the effect of leveraging review- 
level sentiment classification in boosting the process of phrase-level polarity labelling. 

5.5 Parameter Analysis 

In the previous sections, we set equal weights to the different kinds of constraints for 
general experimental purpose. In this subsection, we attempt to study the effect of 
different constraints in our framework by analyzing the four main parameters Ai ^ A 4 
in objective function ( [T3] >. 

We first conduct "Knock Out One Term" experiment on these parameters, to see 
whether all these constraints contribute to the performance of phrase-level polarity 
labelling. We set one of the four parameters to 0 at a time, and evaluate the F-measure. 
The results are shown in Table 0 


Table 5 

Evaluation results on F-measure by knocking out one constraint, where the knocked out 
constraint is represented by 0. 



Ai 

A 2 

A 3 

a 4 

MP3 Player 

Restaurant 

Default 

1 

1 

1 

1 

0.8073 

0.8315 

Knock 

0 

1 

1 

1 

0.6783 

0.6476 

out 

1 

0 

1 

1 

0.6332 

0.6728 

one 

1 

1 

0 

1 

0.7461 

0.7352 

term 

1 

1 

1 

0 

0.7756 

0.7504 


The experimental result shows that knocking out any of the four parameters 
decreases the performance of polarity labelling. Besides, removing the constraint on 
review-Level sentiment orientation (Ai) or the constraint on general sentiment lexicon 
(A 2 ) decreases the performance to a great extent, which implies that these two informa¬ 
tion sources are of great importance in constructing the sentiment lexicon. 

We further investigate the effect of different constraints by fixing three parameters 
to 1 and weighing the remaining parameter. The experimental results on restaurant 
dataset are shown in Figure |4j and the observations on mp3 player dataset are similar. 

The experimental result shows that giving more weights to the constraints of 
review-level sentiment orientation and general sentiment lexicon could further improve 
the performance, which means that these two information sources might be more reli¬ 
able. However, weighting the constraint on sentential sentiment consistency too much 
would decrease the performance, this implies that noise could be introduced by this 
heuristic and it is not as reliable as the linguistic heuristic of "and" and "but". 
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Parameter values 

Figure 4 

Tune one of the parameters in the tuning range of 0 ~ 8 with a tuning step of timing two, while 
fixing the remaining parameters to be 1. 


We tuned the parameters carefully to get the optimal performance. Finally, the 
optimal result on mp3 player dataset was achieved when using the parameters (4, 2, 
1, 0.25), with an F-measure of 0.8237, and on restaurant review dataset (3, 2, 2, 0.5) is 
used, which gives the F-measure of 0.8584. 

6. Conclusions 


Treating the numerical star rating as a sentiment indication of review text is a widely 
used assumption in previous phrase-level sentiment analysis algorithms (Lu et al. 2011: 
|Lu, Zhai^ and Sundaresan 2009: Dave, Lawrence, and Pennock 2003). In this paper, 
however, we investigated the inconsistency between the numerical ratings and textual 
sentiment orientations. Our observations on user rating analysis show that, users tend 
to make biased ratings regardless of the textual reviews they comment on a specific 
product. Besides, the evaluation results on labelling accuracy using different methods 
further verify the existence of such a bias, and indicate the effectiveness of leveraging 
review-level sentiment classification to recover the sentiment orientation of the reviews. 

The biased assumption may hurt the performance of phrase-level sentiment polar¬ 
ity labelling to a large extent, because the numerical rating is usually incorporated as 
a kind of groundtruth to supervise the model learning process in previous work. In 
this paper, however, we attempt to bridge the gap between review-level and phrase- 
level sentiment analysis by leveraging review-level sentiment classification to boost the 
performance of phase-level sentiment polarity labelling. 

Specifically, we formalized the phrase-level sentiment polarity labelling problem 
into a simple convex optimization framework, and incorporated four kinds of heuristics 
to supervise the polarity labelling process. We further designed iterative optimization 
algorithms for model learning, where the global optimal solution is guaranteed due to 
the convexity of the objective function. More over, except for the four kinds of heuristics 
investigated in this paper, the framework is also flexible to integrate various other 
information sources. 

We conducted extensive experiments on two different language environments (En¬ 
glish and Chinese) to investigate the performance of our framework, as well as its 
transportability across different language settings. The experimental results on both 
datasets show that our framework helps to improve the performance in contextual 
sentiment lexicon construction tasks. Besides, the experiment on parameter analysis 
shows that all of the four heuristics that we considered in this study contribute to the 


21 

























Computational Linguistics 


Volume 1, Number 1 


improvement in the performance of polarity labelling, which is also in accordance with 
previous studies. 

In the future we would like to further investigate the effect of incorporating other 
heuristics into the framework for polarity labelling. Besides, it would be interesting 
to further bridge the gap between review- and phrase-level sentiment analysis by 
integrating the two stages into a single unified framework through, for example, deep 
learning techniques. Except for the sentiment polarity labelling task investigated in this 
work, review-level analysis could also be promising to help extract feature or opinion 
words in phrase-level sentiment analysis, and the joint consideration of review- and 
phrase-level analysis may even lead to brand new sentiment analysis tasks. 
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Appendix 

In the objective function f[3) , let C = A 3 D + A 4 D S - A 3 W a - A 4 W S , let A be the 
Lagrange multiplier for the constraint X > 0, and let L(X) be the Lagrange function, 
then we have: 


V x i(X) = 2 AiA t AX - 2 AiA t X + 2A 2 G(X - X 0 ) + 2 CX - 2A 3 W b XE - A (15) 

By setting VxT(X) = 0, we have: 

A = 2 AiA t AX - 2 AiA t X + 2A 2 G(X - X 0 ) + 2 CX - 2A 3 W 5 XE (16) 

According to the Karush-Kuhn-Tucker (KKT) complementary condition (Boyd and 
Vandenberghe 2004) on the non-negativity constraint on X, we have A tJ • X tJ = 0, 
namely: 

[A a A r AX - AiA r X + A 2 G(X - X 0 ) + £X - A 3 W ft X% • X l3 = 0 (17) 

Equation fL7} can be further transformed into: 

[-(AiA r X + A 2 GX 0 + A 3 W a X + A 3 W 6 XE + A 4 W S X) 

(18) 

+ (AiA t AX + A 2 GX + A 3 DX + A 4 D s X)]jj • Xy = 0 

which leads to the updating rule of X as follows: 


Xij i X ij ^ 


1 [A x A r X + A 2 GX 0 + A 3 W°X + A 3 W b XE + A 4 W s X]i 
[AiA t AX + A 2 GX + A 3 DX + A 4 D s X]y 


(19) 


The correctness and convergence of the updating rule can be proved using the 
standard auxiliary function approach presented in (Lee and Seung 2001). 


22 


















Yongfeng Zhang et al. 


Boost Phrase-level with Review-level Sentiment Analysis 


References 

[Bickerstaffe and Zukerman2010]Bickerstaffe, Adrian and Ingrid Zukerman. 2010. A 
Hierarchical Classifier Applied to Multi-way Sentiment Detection. Proceedings of the 21st 
International Conference on Computational Linguistics (Coling), pages 62-70. 

[Boyd and Vandenberghe2004]Boyd, S. and L. Vandenberghe. 2004. Convex Optimization. 
Cambridge University Press. 

[Chang et al.2009]Chang, Pichuan, Huihsin Tseng, Dan Jurafsky, and Christopher D. Manning. 
2009. Discriminative Reordering with Chinese Grammatical Relations Features. Proceedings of 
the Third Workshop on Syntax and Structure in Statistical Translation (SSST), pages 51-59. 

[Cui, Mittal, and Datar2006]Cui, Hang, Vibhu Mittal, and Mayur Datar. 2006. Comparative 
Experiments on Sentiment Classification for Online Product Reviews. Proceedings of the 21st 
national conference on Artificial intelligence (AAAI), 2:1265-1270. 

[Dasgupta and Ng2009]Dasgupta, Sajib and Vincent Ng. 2009. Mine the Easy, Classify the Hard: 
A Semi-Supervised Approach to Automatic Sentiment Classification. Proceedings of the 47th 
Annual Meeting of the Association for Computational Linguistics (ACL), 2:701-709. 

[Dave, Lawrence, and Pennock2003]Dave, Kushal, Steve Lawrence, and David M. Pennock. 

2003. Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product 
Reviews. WWW, pages 519-528. 

[Ding, Liu, and Yu2008]Ding, Xiaowen, Bing Liu, and Philip S. Yu. 2008. A Holistic 
Lexicon-Based Approach to Opinion Mining. Proceedings of the 2008 International Conference on 
Web Search and Data Mining (WSDM), pages 231-239. 

[Goldberg and Zhu2006]Goldberg, Andrew B. and Xiaojin Zhu. 2006. Seeing stars when there 
aren't many stars: Graph-based Semi-supervised Learning for Sentiment Categorization. 
Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, pages 
45-52. 

[Hu and Liu2004]Hu, Minqing and Bing Liu. 2004. Mining and Summarizing Customer 
Reviews. Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery 
and data mining (KDD), pages 168-177. 

[Hu et al.2013]Hu, Xia, Jiliang Tang, Huiji Gao, and Huan Liu. 2013. Unsupervised Sentiment 
Analysis with Emotional Signals. WWW, pages 607-617. 

[Jansen et al.2009]Jansen, Bernard }., Mimi Zhang, Kate Sobel, and Abdur Chowdury. 2009. 
Micro-blogging as Online Word of Mouth Branding. Proceedings of the 2009 International 
Conference on Human Factors in Computing Systems (CHI), pages 3859-3864. 

[Kanayama and Nasukawa2006]Kanayama, Hiroshi and Tetsuya Nasukawa. 2006. Fully 
Automatic Lexicon Expansion for Domain-oriented Sentiment Analysis. Proceedings of the 2006 
Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 355-363. 

[Lee and Seung2001]Lee, Daniel D. and H. Sebastian Seung. 2001. Algorithms for Non-negative 
Matrix Factorization. Proceedings of the Neural Information Processing Systems (NIPS), pages 
556-562. 

[Levy and Manning2003]Levy, Roger and Christopher D. Manning. 2003. Is it harder to parse 
Chinese, or the Chinese Treebank? Proceedings of the 43rd Annual Meeting of the Association for 
Computational Linguistics (ACL), pages 439-446. 

[Li et al.2011]Li, Shoushan, Zhongqing Wang, Guodong Zhou, and Sophia Yat Mei Lee. 2011. 
Semi-Supervised Learning for Imbalanced Sentiment Classification. Proceedings of the 22nd 
International Joint Conference on Artificial Intelligence (IJCAI), 3:1826-1831. 

[Lin, He, and Everson2010]Lin, Chenghua, Yulan He, and Richard Everson. 2010. A 
Comparative Study of Bayesian Models for Unsupervised Sentiment Detection. Proceedings of 
the 14th Conference on Computational Natural Language Learning (CoNLL), pages 144-152. 

[Liu2010]Liu, Bing. 2010. Sentiment Analysis and Subjectivity. Handbook of Natural Language 
Processing, Chapman and Hall/CRC, 2 edition. 

[Liu, Hu, and Cheng2005]Liu, Bing, Minqing Hu, and Junsheng Cheng. 2005. Opinion Observer: 
Analyzing and Comparing Opinions on the Web. WWW, pages 342-351. 

[Liu and Zhang2012]Liu, Bing and Lei Zhang. 2012. A Survey of Opinion Mining and Sentiment 
Analysis. Mining Text Data, pages 415-463. 

[Liu, Seneff, and Zue2010]Liu, Jingjing, Stephanie Seneff, and Victor Zue. 2010. 
Dialogue-Oriented Review Summary Generation for Spoken Dialogue Recommendation 
Systems. Proceedings of the 2010 Annual Conference of the North American Chapter of the 
Association for Computational Linguistics (NAACL), pages 64-72. 


23 



Computational Linguistics 


Volume 1, Number 1 


[Lu et al.2011]Lu, Yue, Malu Castellanos, Umeshwar Dayal, and ChengXiang Zhai. 2011. 
Automatic Construction of a Context-Aware Sentiment Lexicon: An Optimization Approach. 
WWW, pages 347-356. 

[Lu, Zhai, and Sundaresan2009]Lu, Yue, ChengXiang Zhai, and Neel Sundaresan. 2009. Rated 
Aspect Summarization of Short Comments. WWW, pages 131-140. 

[M. Marneffe2006]M. Marneffe, B. Maccartney, C. Manning. 2006. Generating Typed 
Dependency Parses from Phrase Structure Parses. Proceedings of the 5th International Conference 
on Language Resources and Evaluation (LREC), pages 449-454. 

[Maas et al.2011]Maas, Andrew L., Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, 
and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. Proceedings of the 
51st Annual Meeting of the Association for Computational Linguistics (ACL), pages 142-150. 

[Moghaddam and Ester2013]Moghaddam, Samaneh and Martin Ester. 2013. Opinion Mining in 
Online Reviews: Recent Trends. Tutorial on the 22th international conference on World Wide Web. 

[Mullen and Collier2004]Mullen, Tony and Nigel Collier. 2004. Sentiment Analysis using 
Support Vector Machines with Diverse Information Sources. Proceedings of the 2004 Conference 
on Empirical Methods in Natural Language Processing (EMNLP), pages 412-418. 

[Nakagawa, Inui, and Kurohashi2010]Nakagawa, Tetsuji, Kentaro Inui, and Sadao Kurohashi. 
2010. Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables. 
Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for 
Computational Linguistics (NAACL), pages 786-794. 

[Orimaye, Alhashmi, and Siew2013]Orimaye, Sylvester O., Saadat M. Alhashmi, and Eu Gene 
Siew. 2013. Performance and Trends in Recent Opinion Retrieval Techniques. The Knowledge 
Engineering Review, pages 1-30. 

[Pang and Lee2008]Pang, Bo and Lillian Lee. 2008. Opinion Mining and Sentiment Analysis. 
Foundations and Trends in Information Retrieval, 2(1-2):1-135. 

[Pang, Lee, and Vaithyanathan2002]Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. 2002. 
Thumbs up? Sentiment Classification using Machine Learning Techniques. Proceedings of the 
2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 79-86. 

[Popescu and Etzioni2005]Popescu, Ana Maria and Oren Etzioni. 2005. Extracting Product 
Features and Opinions from Reviews. Proceedings of the 2005 Conference on Empirical Methods in 
Natural Language Processing (EMNLP), pages 339-346. 

[Qiu et al.2009]Qiu, Likun, Weishi Zhang, Changjian Hu, and Kai Zhao. 2009. SELC: A 
Self-Supervised Model for Sentiment Classification. Proceedings of the 18th ACM Conference on 
Information and Knowledge Management (CIKM), pages 929-936. 

[Taboada et al.2011]Taboada, Maite, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred 
Stede. 2011. Lexicon-Based Methods for Sentiment Analysis. Computational Linguastics, 
37(2):267-307. 

[Tan et al.2013]Tan, Yunzhi, Yongfeng Zhang, Min Zhang, Yiqun Liu, and Shaoping Ma. 2013. A 
Unified Framework for Emotional Elements Extraction Based on Finite State Matching 
Machine. Natural Language Processing and Chinese Computing (NLP&CC), pages 60-71. 

[Turney2002]Turney, Peter D. 2002. Thumbs Up or Thumbs Down? Sentiment Orientation 
Applied to Unsupervised Classification of Reviews. Proceedings of the 42nd Annual Meeting of 
the Association for Computational Linguistics (ACL), pages 417-424. 

[Wang, Lu, and Zhai2010]Wang, Hongning, Yue Lu, and Chengxiang Zhai. 2010. Latent Aspect 
Rating Analysis on Review Text Data: A Rating Regression Approach. KDD, pages 783-792. 

[Wiebe, Wilson, and Cardie2005]Wiebe, Janyce, Theresa Wilson, and Claire Cardie. 2005. 
Annotating Expressions of Opinions and Emotions in Language. Language Resources and 
Evaluation, 39:165-210. 

[Wilson, Wiebe, and Hoffmann2005]Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann. 2005. 
Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proceedings of the 2005 
Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 347-354. 

[Yessenalina, Yue, and Cardie2010]Yessenalina, Ainur, Yisong Yue, and Claire Cardie. 2010. 
Multi-level Structured Models for Document-level Sentiment Classification. Proceedings of the 
2010 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1046-1056. 

[Zagibalov and Carroll2008a]Zagibalov, Taras and John Carroll. 2008a. Automatic Seed Word 
Selection for Unsupervised Sentiment Classification of Chinese Text. Proceedings of the 19st 
International Conference on Computational Linguistics (Coling), pages 1073-1080. 

[Zagibalov and Carroll2008b]Zagibalov, Taras and John Carroll. 2008b. Unsupervised 
Classification of Sentiment and Objectivity in Chinese Text. IJCNLP, pages 304-311. 


24 



Yongfeng Zhang et al. 


Boost Phrase-level with Review-level Sentiment Analysis 


[Zhang2015]Zhang, Yongfeng. 2015. Incorporating Phrase-level Sentiment Analysis on Textual 
Reviews for Personalized Recommendation. Proceedings of the 8th ACM international conference 
on Web Search and Data Mining (WSDM). 

[Zhang et al.2014a]Zhang, Yongfeng, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and 
Shaoping Ma. 2014a. Explicit Factor Models for Explainable Recommendation based on 
Phrase-level Sentiment Analysis. Proceedings of the 37th international ACM SIGIR conference on 
Research & development in information retrieval (SIGIR), pages 83-92. 

[Zhang et al.2014b]Zhang, Yongfeng, Haochen Zhang, Min Zhang, Yiqun Liu, and Shaoping Ma. 
2014b. Do Users Rate or Review? Boost Phrase-level Sentiment Labeling with Review-level 
Sentiment Classification. Proceedings of the 37th international ACM SIGIR conference on Research 
& development in information retrieval (SIGIR), pages 1027-1030. 

[Zhang et al.2015]Zhang, Yongfeng, Min Zhang, Yi Zhang, Guokun Lai, Yiqun Liu, and 
Shaoping Ma. 2015. Daily-Aware Personalized Recommendation based on Feature-Level Time 
Series Analysis. Proceedings of the 24nd international conference on World Wide Web (WWW). 

[Zhou, Chen, and Wang2010]Zhou, Shusen, Qingcai Chen, and Xiaolong Wang. 2010. Active 
Deep Networks for Semi-Supervised Sentiment Classification. Proceedings of the 21st 
International Conference on Computational Linguistics (Coling), pages 1515-1523. 


25 






